A Contrasting Evaluation of Deep learning and Machine learning Approaches for Identifying the Existence of Fake Health News

Authors: Darshan Deshmukh, Chaitrali Kadam, Tejas Saraf

DOI Link: https://doi.org/10.22214/ijraset.2023.57573

Abstract

The issue of fake news, which was present even before Internet penetration, has been made worse by the growth and penetration of the internet. If there is news concerning health, this becomes even more concerning. This study suggests using feature-based models (FBM) and content-based models (CBM) to address this problem. The input given determines how the two models differ from one another. While the FBM also accepts two readability features as input in addition to content, the CBM only accepts news content. Two hybrid Deep Learning approaches, CNN-LSTM and CNN-BiLSTM, are compared with the performance of five traditional machine learning techniques, under each category: Decision Tree, Random Forest, Support Vector Machine, AdaBoost-Decision Tree, and AdaBoost-Random Forest.

Introduction

I. INTRODUCTION

The way we access and share information has been completely transformed by the Internet. The Internet has greatly benefited society, but it has also made it possible for false information and fake news to spread quickly. In this day and age, the term "fake news" has gained increasing popularity. It is nothing more than false information that has been altered and cannot be independently verified. It is defined as "news that is intentionally and verifiably false" [1] and is disseminated with the goal of deceiving people. Historically speaking, the "Great Moon Hoax" was one of a number of Articles about the discovery of life on the moon in the New York Sun back in 1835 [2] were coordinated by Nuno M. Garcia, the associate editor, who also approved the manuscript for publication. The widespread dissemination of information from a variety of sources, including online newspapers, blogs, social media, magazines, and various forums, has resulted from high internet penetration, though, and this has made it challenging to determine the veracity of published news [3]. For example, the U.S. presidential election of 2016 generated a lot of talk about fake news [4]. In over 25 countries, the Center of International Governance Innovation (CIGI) conducted an Ipsos survey, and 86% of users acknowledged that they had come across fake news, even though at first, they thought it was real [5]. Compared to the global average of 57%, 60% of Indians reported having seen fake news online in a Microsoft survey [6]. The political sphere is where fake news is most prevalent, but it is now spreading to many other domains. For example, reports of the January 2020 Australian wildfire spread a lot of false information about the occurrence [7]. The COVID-19 pandemic fueled the flames of disseminating false information about everything from the virus's origin and pathogenicity to treatments and cures. When dealing with the virus, medical personnel found it extremely challenging to control the spread of false information. Along with the global pandemic, the World Health Organization (WHO) issued a warning about a "infodemic" due to the widespread dissemination of incorrect information regarding the virus's cause, spread, treatment, and prevention [8]. For example, a US citizen who was informed that chloroquine might be used to treat COVID died after taking medication [9]. In addition to recently discovered viruses or bacteria, other illnesses like cancer, its cause and treatment, autism, dementia, and urological disorders are also being spread [9], [10], [11], [12], and [13]. Over 70% of adults use the Internet, which has a very high penetration rate, to search for information related to healthcare; however, the information they find may not always be accurate.

Since fake news affects human life, its effects on the health sector may be more detrimental than in other areas. False information spreading can have detrimental effects on patients, healthcare costs, and provider trust, among other things. According to a thorough analysis, people who are exposed to false and misleading information about health issues face difficulties in their mental, social, political, and/or financial lives. One piece of fake news about medicine, for example, caused at least 800 deaths and 5,800 hospital admissions [14].

II. BACKGROUND RESEARCH

The three most researched areas for the classification of fake news are politics, tourism, and marketing, while the least researched area is health care [15]. This research focuses on the healthcare domain and presents the literature related to it because the impact of identifying fake news in this field makes it more important than in any other. Research in this area can be divided into two categories based on the methodology employed: deep learning approaches or traditional machine learning approaches for classifying fake news. A Health Lies dataset was created that included false information and facts about a number of diseases, including AIDS, Zika, cancer, and Covid.

The results of these models were compared to the BERT Model. According to the findings, BERT performed better than any other conventional model [16]. A Random Forest Classifier with an F1 score of 85% was used to build a classifier to identify fake news related to autism [17]. Four Deep Learning approaches— CNN, RNN, GRU, and RNN— were compared to the performance of conventional machine learning algorithms like Naive Bayes, Nearest Neighbor, Random Forest, Logistics Regression, Adaboost, and Neural Network. For the COVID 19 dataset, the results demonstrate that deep learning algorithms outperform conventional machine learning algorithms [18]. Seven state-of-the-art methods were compared with Cross-SEAN, a method for detecting false news and semi-supervised text classification models that learn from important external facts and partially generalize to newly emerging false news.

According to the findings, it outperformed the best baseline by 9% on CTF, a sizable COVID-19 Twitter dataset, with an F1 Score of 0.95 [19]. In order to identify fake news, the effectiveness of conventional machine learning algorithms was compared in [20]. These algorithms included Multinomial Naïve Bayes, Support Vector Machine, Logistic Regression, and Random Forest. They took topical, structural, and semantic patterns into consideration. For the purpose of identifying fake news, both conventional and deep learning techniques were evaluated using the Covid-19 dataset. The findings demonstrated that models based on deep learning are more adept at identifying false information [21]. After taking linguistic and sentiment features into account, a Random Forest classifier was developed to identify fake news for COVID-19 [22]. Using feature selection, a Random tree-based classifier with an F1 score of 94.5% was developed to detect fake tweets linked to the Zika virus [23]. Compared to the healthcare domain, comparatively more work has been done in other domains. For instance, in [24], scientists presented a two-phase method called Welfare that employs supervised machine learning models to identify false news using word embedding over linguistic variables.

Linguistic features were used in the initial phase to verify the veracity of news reports. Voting classification was carried out in the second stage when word embedding and linguistic feature sets were combined. With articles pertaining to the political domain, the Welfare model's accuracy of 96.73% was higher than the maximum accuracies of the CNN and BERT models, which were 92.48% and 93.79%, respectively. An accuracy and F1 score of roughly 94% and 98%, respectively, were obtained with a Deep Learning Model [25]. The performance of traditional machine learning models (Binomial Linear Regression, Naïve Bayes Classifier) was compared with Deep Learning models (CNN, LSTM).

III. METHODOLOGY

Building Content Based and Proposed Feature Based models using machine learning and Deep Learning techniques is the suggested research methodology shown in Fig 1. Whereas the Feature Based models use additional readability features as input in addition to the content to build models and compare their performance, the first case uses only the content (fake news) to build the models.

A. Data Categorizing

For the study of fake news in politics, there are many open datasets available; however, datasets in the healthcare field are incredibly rare and small.

The Health LIES dataset, a well-known publicly accessible dataset, comprises 12,267 sentences that are classified as true or false according to whether they contain accurate or inaccurate health information. Sentences from news articles, social media, and websites pertaining to health were gathered to create the Health LIES dataset [30]. Furthermore, the Fake News Healthcare (FNH) dataset [31] is a compiled dataset that focuses on fake news in the healthcare sector. There are 9581 tagged news items in this dataset; 1816 are categorized as fraudulent, and 7765 are legitimate. The URL, article title, and article length are among the extra details included in the dataset.

The FNH dataset contains both fake and genuine news samples. The fake news samples were gathered from reliable websites like PolitiFact and theonion.com, while the genuine news samples were sourced from CNN, BBC News, and The Atlantic. The FNH dataset was chosen for this study due to the availability of additional data that will be used in the model-building process.

B. Augmentation Of Data

The FNH comprises two classes: True and Fake, and is a highly imbalanced dataset. Fake news made up 23.6% of the data, while true news made up 76.4%. The ratio of imbalance for the FNH dataset was 4:1 when comparing the number of documents that contain real news to those that contain fake news. This kind of imbalance in the dataset makes it impossible to create accurate models and reduces the accuracy of the results. In order to provide a balanced and useful dataset, data augmentation techniques are therefore needed to address this issue by randomly replicating samples from the minority class. By creating synthetic data from the available data, data augmentation is done to balance the dataset [32]. This is a widely used method in computer vision, but because it requires knowledge of the text's grammatical structure, it becomes more challenging in natural language processing (NLP) [33].

C. Prior To Processing Data

To get a valid set of tokens for every article, the dataset had to be balanced. By eliminating the numbers and special characters, this was accomplished. Incorporating stop words and punctuation marks enhances the text's context and is retained when using word embedding for feature extraction. Lemmatization was then used to extract the root words. A list of valid tokens was the end product of this preprocessing.

D. Component Extraction

Traditional machine learning models use Glove Word Embeddings for feature extraction, while deep learning models use Term Frequency-Inverse Document Frequency (tf-idf). For FBM, readability features were extracted.

TF-IDF: It is a commonly employed method for producing document vectors. Based on a term's importance both within and across documents in relation to the available corpus, this vector turns words into numbers [35].
Embedding Words: Word embedding must be used to transform the legitimate tokens that were obtained during preprocessing into number vectors. Word embedding takes into account the semantic and syntactic information of the word, so words with similar contexts and meanings will be near to one another [36]. The matrix for this study was obtained using GloVe (Global Vectors), a well-liked pre-trained word embedding technique [37]. It was trained on a six-billion-word dataset with 400,000 words in its vocabulary.

E. Readability Aspects

Fake news in healthcare can be particularly dangerous as it can mislead people to make decisions that can have negative consequences on their health.

By using readability metrics, such as Simple Measure of Gobbledygook (SMOG) score and Type-Token Ratio (TTR), we can assess the readability of healthcare-related content and identify potentially fake news articles that are difficult to comprehend or contain a high proportion of rare or unique words [38].

The specialized language and terminology used in the medical literature makes SMOG and TTR especially significant in the healthcare industry. False news has an easier time spreading and deceiving people because medical terminology can be complicated and challenging for the average person to understand. For this reason, these characteristics can be used to detect articles that, especially when it comes to patient care.

IV. MODEL CONSTRUCTION

This section suggests developing classifier models under the two headings of content-based models and feature-based models. To determine which classifier is best for identifying fake news, the suggested Deep Learning Algorithms are compared with the performance of traditional machine learning algorithms for both categories.

A. Conventional Machine Learning Frameworks

To evaluate the effectiveness of the CBM and FBM categories, Decision Tree, Random Forest, Support Vector Machine, AdaBoost Decision tree, and AdaBoost Random Forest models were developed.

Decision Tree

Using a hierarchical tree structure, decision trees are a type of modeling technique used to create regression or classification models. As a decision tree is being built, a dataset is recursively divided into ever-smaller subsets. The decision and leaf nodes are present in the final tree, where each leaf node represents a classification or decision. The decision tree's root node, which stands for the best predictor, is at the top. "Information Gain" refers to the process of dividing data using entropy. Decision trees are non-parametric and can process both numerical and categorical data, which allows them to manage big and complicated datasets efficiently without requiring a convoluted parametric framework. Where T is the target variable, Gain (T, X) = Entropy (T) − Entropy (T, X) (3) X = Feature to be divided on Entropy (T, X) = Entropy computed following the division of the data along feature X.

2. Random Forest

Using the "bagging" method, Random Forest is a supervised learning technique that creates an ensemble of decision trees. Several learning models are combined in this method to enhance the final product. About one-third of the samples—known as "out-of-bag samples"—are used to test the model when data is sampled using replacement. The root node can be chosen as the feature and the impurity of the dataset can be evaluated using the Gini index. With the assumption that a decision tree is binary and has only two child nodes, Scikit-learn calculates the Gini Importance of each node for each decision tree.

3. Support Vector Machine

The goal of a Support Vector Machine (SVM) is to find a hyperplane in an N-dimensional space that divides the data points into different categories. The number of features determines the size, and the goal is to create the best decision boundary or line possible for precise classification of fresh data points. The linear kernel works very well in scenarios with lots of features, like text classification tasks. Compared to most other kernel functions, the linear kernel functions are faster.

4. Adaboost Theory

AdaBoost is an ensemble learning technique that combines multiple classifiers to improve the accuracy of the classifiers. By combining multiple weak classifiers, the AdaBoost classifier creates a strong and robust classifier that is incredibly accurate and dependable. Setting classifier weights and training data samples for each iteration is the main idea behind AdaBoost, which enables precise predictions for uncommon observations. As a basic classifier in AdaBoost, any machine learning technique that takes training set weights into account is suitable.

B. Models Of Deep Learning

To create a fake news classifier in this category, two deep learning-based models are suggested: Hybrid CNN-LSTM and Hybrid CNN-BiLSTM.

CNN-LSTM Model

In this study, a hybrid model that combines CNN and LSTM is proposed, as shown in Fig 2. The CNN layer (Conv1D), which is one-dimensional, comes after the embedding layer. Using the ReLU activation function and 64 filters with a kernel size of 5, this layer is used to extract local features. Large feature vectors are the result, and these are fed into the MaxPooling 1D layer with a four-window size. This makes it possible to reduce the feature vectors' dimensions. Two LSTM layers receive the pooled feature maps as input, and while maintaining memory, they output the long-term dependent features of the input feature maps. Twenty neurons with an output dimension of twenty make up each LSTM layer, which uses a linear activation function.

2. CNN-BILSTM Hybrid Model

The model shares the same architecture as the hybrid CNN-LSTM model. As shown in Fig. 3, the only modification made is the use of the Bi-directional LSTM layer rather than the LSTM layers. To obtain the classification, it consists of multiple layers: the word-embedding layer, CNN layer, max pooling layer, bi-directional LSTM layer, and dense layer. The input of a bi-directional LSTM flows in both directions and contains data about the past as well as the present. As a result, it may yield a more significant result.

3. Model Evaluation Parameters

As indicated in Table 1, four metrics were used to assess the model's performance: accuracy, precision, recall, and F1 score. Four estimation parameters were used to assess the model: False Positive (FP), False Negative (TN), True Positive (TP), and False Negative (FN). When the model accurately describes the negative class, it yields a true negative result; conversely, a true positive outcome happens when the model correctly predicts the positive class. Conversely, a false negative result happens when the model predicts the negative class incorrectly, while a false positive result happens when the model estimates the positive class incorrectly.

V. RESULTS AND DISCUSSIONS

Because GPUs are available for heavy computation, CBM and FBM use Machine Learning and Deep Learning approaches, which are implemented in the Google Collab environment. The Python code was created utilizing Packages: Matplotlib, NumPy, Pandas, Scikit-learn, and Kera’s. Glove word embedding with 100 dimensions was used for the Deep Learning models. The Deep Learning based models were constructed using a sequential model that was made available in Kera’s and comprised multiple layers of neurons.

VII. ACKNOWLEDGMENT

The Princess Nourah bint Abdulrahman University in Riyadh, Saudi Arabia, is gratefully acknowledged by the authors for its support of this study.

Conclusion

Fake news is becoming more and more common, so spotting it and stopping its spread are essential. In order to determine which model performs the best, the research suggested comparing different models under the categories of feature-based and content-based models. Even when compared to deep learning models, the results show that Adaboost-RF under FBM is the best performing model. This is consistent with Occam\'s Razor, which states that due to trade-offs between model simplicity, resource usage, and execution time, simple models are preferred over complex models [39].

References

[1] H. Allcott and M. Gentzkow, ‘‘Social media and fake news in the 2016 election,’’ J. Econ. Perspect., vol. 31, no. 2, pp. 211–236, May 2017, doi: 10.1257/jep.31.2.211. [2] I. K. Vida, ‘‘The ‘great moon hoax’of 1835,’’ Hung. J. English Amer. Stud., vol. 18, nos. 1–2, pp. 431–441, 2012. [3] A. M. Kaplan, ‘‘Social media, the digital revolution, and the business of media,’’ Int. J. Media Manage., vol. 17, no. 4, pp. 197–199, Oct. 2015, doi: 10.1080/14241277.2015.1120014. [4] J. Albright. (2016). The Election 2016 Micro-Propaganda Machine. Accessed: May 2023. [Online]. Available: https://d1gi.medium.com/theelection2016-micro-propaganda-machine-383449cc1fba [5] IPSOS. (2019). Fake News: A Global Epidemic Vast Majority (86%) of Online Global Citizens Have Been Exposed to it. Accessed: Mar. 2023. [Online]. Available: https://www.ipsos.com/en-us/news-polls/cigi-fakenews-global-epidemic [6] BI. (2019). India Has More Fake News Than Any Other Country in the World: Survey. Accessed: Mar. 2023. [Online]. Available:https://www.businessinsider.in/india-has-more-fake-news-than-anyother-country-in-the-world-survey/articleshow/67868418.cms [7] G. Rannard. (2020). Australia Fires: Misleading Maps and Pictures go Viral. [Online]. Available: https://www.bbc.com/news/blogs-trending51020564 [8] Z. Thomas. (2020). WHO Says Fake Coronavirus Claims Causing ’Infodemic’. BBC. Accessed: Mar. 2023. [Online]. Available: https://bbc.in/2xUcaAh [9] Z. Barua, S. Barua, S. Aktar, N. Kabir, and M. Li, ‘‘Effects of misinformation on COVID-19 individual responses and recommendations for resilience of disastrous consequences of misinformation,’’ Prog. Disaster Sci., vol. 8, Dec. 2020, Art. no. 100119, doi: 10.1016/j.pdisas.2020.100119. [10] S. Loeb, J. Taylor, J. F. Borin, R. Mihalcea, V. Perez-Rosas, N. Byrne, A. L. Chiang, and A. Langford, ‘‘Fake news: Spread of misinformation about urological conditions on social media,’’ Eur. Urol. Focus, vol. 6, no. 3, pp. 437–439, May 2020. [11] R. Bal, S. Sinha, S. Dutta, R. Joshi, S. Ghosh, and R. Dutt, ‘‘Analysing the extent of misinformation in cancer related tweets,’’ in Proc. Int. AAAI Conf. Web Social Media, vol. 14, no. 1, 2020, pp. 924–928. [12] S. Kumari, H. K. Reddy, C. S. Kulkarni, and V. Gowthami, ‘‘Debunking health fake news with domain specific pre-trained model,’’ Global Transitions Proc., vol. 2, no. 2, pp. 267–272, Nov. 2021. [13] C. Melchior and M. Oliveira, ‘‘Health-related fake news on social media platforms: A systematic literature review,’’ New Media Soc., vol. 24, no. 6, pp. 1500–1522, Jun. 2022. [14] A. Coleman. (2020). ‘Hundreds of Dead’ Because of COVID-19 Misinformation. BBC News. [Online]. Available: http://bbc.com/news/world53755067 [15] S. Rastogi and D. Bansal, ‘‘A review on fake news detection 3T’s: Typology, time of detection, taxonomies,’’ Int. J. Inf. Secur., vol. 22, no. 1, pp. 177–212, Feb. 2023. [16] G. Chaphekar and J. G. Jetcheva, ‘‘HealthLies: Dataset and machine learning models for detecting fake health news,’’ in Proc. IEEE 8th Int. Conf. Big Data Comput. Service Appl. (BigDataService), Newark, CA, USA, Aug. 2022, pp. 1–8, doi: 10.1109/BigDataService55688.2022.00008. [17] Y. Zhao, J. Da, and J. Yan, ‘‘Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches,’’ Inf. Process. Manage., vol. 58, no. 1, Jan. 2021, Art. no. 102390. [18] W. H. Bangyal, R. Qasim, N. U. Rehman, Z. Ahmad, H. Dar, L. Rukhsar, Z. Aman, and J. Ahmad, ‘‘Detection of fake news text classification on COVID-19 using deep learning approaches,’’ Comput. Math. Methods Med., vol. 2021, pp. 1–14, Nov. 2021. [19] W. S. Paka, R. Bansal, A. Kaushik, S. Sengupta, and T. Chakraborty, ‘‘Cross-SEAN: A cross-stitch semi-supervised neural attention model for COVID-19 fake news detection,’’ Appl. Soft Comput., vol. 107, Aug. 2021, Art. no. 107393, doi: 10.1016/j.asoc.2021.107393. [20] S. Dhoju, M. M. U. Rony, M. A. Kabir, and N. Hassan, ‘‘Differences in health news from reliable and unreliable media,’’ in Proc. Companion World Wide Web Conf., May 2019, pp. 1–14. [21] R. Garg and S. Jeevraj, ‘‘Effective fake news classifier and its applications to COVID-19,’’ in Proc. IEEE Bombay Sect. Signature Conf. (IBSSC), Nov. 2021, pp. 1–6, doi: 10.1109/IBSSC53889.2021.9673448. [22] S. Khan, S. Hakak, N. Deepa, B. Prabadevi, K. Dev, and S. Trelova, ‘‘Detecting COVID-19-related fake news using feature extraction,’’ Frontiers Public Health, vol. 9, Jan. 2022, Art. no. 788074. [23] A. Ghenai and Y. Mejova, ‘‘Catching Zika fever: Application of crowdsourcing and machine learning for tracking health misinformation on Twitter,’’ in Proc. IEEE Int. Conf. Healthcare Informat. (ICHI), Park City, UT, USA, Aug. 2017, p. 518, doi: 10.1109/ICHI.2017.58. [24] P. K. Verma, P. Agrawal, I. Amorim, and R. Prodan, ‘‘WELFake: Word embedding over linguistic features for fake news detection,’’ IEEE Trans. Computat. Social Syst., vol. 8, no. 4, pp. 881–893, Aug. 2021, doi: 10.1109/TCSS.2021.3068519. [25] C.-M. Lai, M.-H. Chen, E. Kristiani, V. K. Verma, and C.-T. Yang, ‘‘Fake news classification based on content level features,’’ Appl. Sci., vol. 12, no. 3, p. 1116, Jan. 2022. [26] D. Rohera, H. Shethna, K. Patel, U. Thakker, S. Tanwar, R. Gupta, W.-C. Hong, and R. Sharma, ‘‘A taxonomy of fake news classification techniques: Survey and implementation aspects,’’ IEEE Access, vol. 10, pp. 30367–30394, 2022. [27] B. Palani and S. Elango, ‘‘CTrL-FND: Content-based transfer learning approach for fake news detection on social media,’’ Int. J. Syst. Assurance Eng. Manage., vol. 14, no. 3, pp. 903–918, Jun. 2023. [28] O. Ajao, D. Bhowmik, and S. Zargari, ‘‘Fake news identification on Twitter with hybrid CNN and RNN models,’’ in Proc. 9th Int. Conf. Social Media Soc., Jul. 2018, pp. 226–230. [29] J. A. Nasir, O. S. Khan, and I. Varlamis, ‘‘Fake news detection: A hybrid CNN-RNN based deep learning approach,’’ Int. J. Inf. Manage. Data Insights, vol. 1, no. 1, Apr. 2021, Art. no. 100007, doi: 10.1016/j.jjimei.2020.100007. [30] E. Dai, Y. Sun, and S. Wang, ‘‘Ginger cannot cure cancer: Battling fake health news with a comprehensive data repository,’’ in Proc. 14th Int. AAAI Conf. Web Social Media, 2020, p. 14.

Copyright

Copyright © 2023 Darshan Deshmukh, Chaitrali Kadam, Tejas Saraf. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET57573

Publish Date : 2023-12-15

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here